{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate on MIRACL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[MIRACL](https://project-miracl.github.io/) (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages. They release a multilingual retrieval dataset containing the train and dev set for 16 “known languages” and only dev set for 2 “surprise languages”. The topics are generated by native speakers of each language, who also label the relevance between the topics and a given document list. You can found the dataset on HuggingFace."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: We highly recommend you to run the evaluation of MIRACL on GPU. For reference, it takes about an hour for the whole process on a 8xA100 40G node."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Installation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First install the libraries we are using:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "% pip install FlagEmbedding pytrec_eval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the great number of passages and articles in the 18 languages. MIRACL is a resourceful dataset for training or evaluating multi-lingual model. The data can be downloaded from [Hugging Face](https://huggingface.co/datasets/miracl/miracl)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Language        | # of Passages | # of Articles |\n",
    "|:----------------|--------------:|--------------:|\n",
    "| Arabic (ar)     |     2,061,414 |       656,982 |\n",
    "| Bengali (bn)    |       297,265 |        63,762 |\n",
    "| English (en)    |    32,893,221 |     5,758,285 |\n",
    "| Spanish (es)    |    10,373,953 |     1,669,181 |\n",
    "| Persian (fa)    |     2,207,172 |       857,827 |\n",
    "| Finnish (fi)    |     1,883,509 |       447,815 |\n",
    "| French (fr)     |    14,636,953 |     2,325,608 |\n",
    "| Hindi (hi)      |       506,264 |       148,107 |\n",
    "| Indonesian (id) |     1,446,315 |       446,330 |\n",
    "| Japanese (ja)   |     6,953,614 |     1,133,444 |\n",
    "| Korean (ko)     |     1,486,752 |       437,373 |\n",
    "| Russian (ru)    |     9,543,918 |     1,476,045 |\n",
    "| Swahili (sw)    |       131,924 |        47,793 |\n",
    "| Telugu (te)     |       518,079 |        66,353 |\n",
    "| Thai (th)       |       542,166 |       128,179 |\n",
    "| Chinese (zh)    |     4,934,368 |     1,246,389 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "lang = \"en\"\n",
    "corpus = load_dataset(\"miracl/miracl-corpus\", lang, trust_remote_code=True)['train']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each passage in the corpus has three parts: `docid`, `title`, and `text`. In the structure of document with docid `x#y`, `x` indicates the id of Wikipedia article, and `y` is the number of passage within that article. The title is the name of the article with id `x` that passage belongs to. The text is the text body of the passage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'docid': '56672809#4',\n",
       " 'title': 'Glen Tomasetti',\n",
       " 'text': 'In 1967 Tomasetti was prosecuted after refusing to pay one sixth of her taxes on the grounds that one sixth of the federal budget was funding Australia\\'s military presence in Vietnam. In court she argued that Australia\\'s participation in the Vietnam War violated its international legal obligations as a member of the United Nations. Public figures such as Joan Baez had made similar protests in the USA, but Tomasetti\\'s prosecution was \"believed to be the first case of its kind in Australia\", according to a contemporary news report. Tomasetti was eventually ordered to pay the unpaid taxes.'}"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The qrels have following form:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev = load_dataset('miracl/miracl', lang, trust_remote_code=True)['dev']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'query_id': '0',\n",
       " 'query': 'Is Creole a pidgin of French?',\n",
       " 'positive_passages': [{'docid': '462221#4',\n",
       "   'text': \"At the end of World War II in 1945, Korea was divided into North Korea and South Korea with North Korea (assisted by the Soviet Union), becoming a communist government after 1946, known as the Democratic People's Republic, followed by South Korea becoming the Republic of Korea. China became the communist People's Republic of China in 1949. In 1950, the Soviet Union backed North Korea while the United States backed South Korea, and China allied with the Soviet Union in what was to become the first military action of the Cold War.\",\n",
       "   'title': 'Eighth United States Army'},\n",
       "  {'docid': '29810#23',\n",
       "   'text': 'The large size of Texas and its location at the intersection of multiple climate zones gives the state highly variable weather. The Panhandle of the state has colder winters than North Texas, while the Gulf Coast has mild winters. Texas has wide variations in precipitation patterns. El Paso, on the western end of the state, averages of annual rainfall, while parts of southeast Texas average as much as per year. Dallas in the North Central region averages a more moderate per year.',\n",
       "   'title': 'Texas'},\n",
       "  {'docid': '3716905#0',\n",
       "   'text': 'A French creole, or French-based creole language, is a creole language (contact language with native speakers) for which French is the \"lexifier\". Most often this lexifier is not modern French but rather a 17th-century koiné of French from Paris, the French Atlantic harbors, and the nascent French colonies. French-based creole languages are spoken natively by millions of people worldwide, primarily in the Americas and on archipelagos throughout the Indian Ocean. This article also contains information on French pidgin languages, contact languages that lack native speakers.',\n",
       "   'title': 'French-based creole languages'},\n",
       "  {'docid': '22399755#18',\n",
       "   'text': 'There are many hypotheses on the origins of Haitian Creole. Linguist John Singler suggests that it most likely emerged under French control in colonial years when shifted its economy focused heavily on sugar production. This resulted in a much larger population of enslaved Africans, whose interaction with the French created the circumstances for the dialect to evolve from a pidgin to a Creole. His research and the research of Claire Lefebvre of the Université du Québec à Montréal suggests that Creole, despite drawing 90% of its lexicon from French, is the syntactic cousin of Fon, a Gbe language of the Niger-Congo family spoken in Benin. At the time of the emergence of Haitian Creole, 50% of the enslaved Africans in Haiti were Gbe speakers.',\n",
       "   'title': 'Haitian literature'}],\n",
       " 'negative_passages': [{'docid': '1170520#2',\n",
       "   'text': 'Louisiana Creole is a contact language that arose in the 18th century from interactions between speakers of the lexifier language of Standard French and several substrate or adstrate languages from Africa. Prior to its establishment as a Creole, the precursor was considered a pidgin language. The social situation that gave rise to the Louisiana Creole language was unique, in that the lexifier language was the language found at the contact site. More often the lexifier is the language that arrives at the contact site belonging to the substrate/adstrate languages. Neither the French, the French-Canadians, nor the African slaves were native to the area; this fact categorizes Louisiana Creole as a contact language that arose between exogenous ethnicities. Once the pidgin tongue was transmitted to the next generation as a \"lingua franca\" (who were then considered the first native speakers of the new grammar), it could effectively be classified as a creole language.',\n",
       "   'title': 'Louisiana Creole'},\n",
       "  {'docid': '49823#1',\n",
       "   'text': 'The precise number of creole languages is not known, particularly as many are poorly attested or documented. About one hundred creole languages have arisen since 1500. These are predominantly based on European languages such as English and French due to the European Age of Discovery and the Atlantic slave trade that arose at that time. With the improvements in ship-building and navigation, traders had to learn to communicate with people around the world, and the quickest way to do this was to develop a pidgin, or simplified language suited to the purpose; in turn, full creole languages developed from these pidgins. In addition to creoles that have European languages as their base, there are, for example, creoles based on Arabic, Chinese, and Malay. The creole with the largest number of speakers is Haitian Creole, with almost ten million native speakers, followed by Tok Pisin with about 4 million, most of whom are second-language speakers.',\n",
       "   'title': 'Creole language'},\n",
       "  {'docid': '1651722#10',\n",
       "   'text': 'Krio is an English-based creole from which descend Nigerian Pidgin English and Cameroonian Pidgin English and Pichinglis. It is also similar to English-based creole languages spoken in the Americas, especially the Gullah language, Jamaican Patois (Jamaican Creole), and Bajan Creole but it has its own distinctive character. It also shares some linguistic similarities with non-English creoles, such as the French-based creole languages in the Caribbean.',\n",
       "   'title': 'Krio language'},\n",
       "  {'docid': '540382#4',\n",
       "   'text': 'Until recently creoles were considered \"degenerate\" dialects of Portuguese unworthy of attention. As a consequence, there is little documentation on the details of their formation. Since the 20th century, increased study of creoles by linguists led to several theories being advanced. The monogenetic theory of pidgins assumes that some type of pidgin language — dubbed West African Pidgin Portuguese — based on Portuguese was spoken from the 15th to 18th centuries in the forts established by the Portuguese on the West African coast. According to this theory, this variety may have been the starting point of all the pidgin and creole languages. This may explain to some extent why Portuguese lexical items can be found in many creoles, but more importantly, it would account for the numerous grammatical similarities shared by such languages, such as the preposition \"na\", meaning \"in\" and/or \"on\", which would come from the Portuguese contraction \"na\" meaning \"in the\" (feminine singular).',\n",
       "   'title': 'Portuguese-based creole languages'},\n",
       "  {'docid': '49823#7',\n",
       "   'text': 'Other scholars, such as Salikoko Mufwene, argue that pidgins and creoles arise independently under different circumstances, and that a pidgin need not always precede a creole nor a creole evolve from a pidgin. Pidgins, according to Mufwene, emerged in trade colonies among \"users who preserved their native vernaculars for their day-to-day interactions.\" Creoles, meanwhile, developed in settlement colonies in which speakers of a European language, often indentured servants whose language would be far from the standard in the first place, interacted extensively with non-European slaves, absorbing certain words and features from the slaves\\' non-European native languages, resulting in a heavily basilectalized version of the original language. These servants and slaves would come to use the creole as an everyday vernacular, rather than merely in situations in which contact with a speaker of the superstrate was necessary.',\n",
       "   'title': 'Creole language'},\n",
       "  {'docid': '11236157#2',\n",
       "   'text': 'While many creoles around the world have lexicons based on languages other than Portuguese (e.g. English, French, Spanish, Dutch), it was hypothesized that such creoles were derived from this lingua franca by means of relexification, i.e. the process in which a pidgin or creole incorporates a significant amount of its lexicon from another language while keeping the grammar intact. There is some evidence that relexification is a real process. Pieter Muysken and show that there are languages which derive their grammar and lexicon from two different languages respectively, which could be easily explained with the relexification hypothesis. Also, Saramaccan seems to be a pidgin frozen in the middle of relexification from Portuguese to English. However, in cases of such mixed languages, as call them, there is never a one-to-one relationship between the grammar or lexicon of the mixed language and the grammar or lexicon of the language they attribute it to.',\n",
       "   'title': 'Monogenetic theory of pidgins'},\n",
       "  {'docid': '1612877#8',\n",
       "   'text': 'A mixed language differs from pidgins, creoles and code-switching in very fundamental ways. In most cases, mixed language speakers are fluent, even native, speakers of both languages; however, speakers of Michif (a N-V mixed language) are unique in that many are not fluent in both of the sources languages. Pidgins, on the other hand, develop in a situation, usually in the context of trade, where speakers of two (or more) different languages come into contact and need to find some way to communicate with each other. Creoles develop when a pidgin language becomes a first language for young speakers. While creoles tend to have drastically simplified morphologies, mixed languages often retain the inflectional complexities of one, or both, of parent languages. For instance, Michif retains the complexities of its French nouns and its Cree verbs.',\n",
       "   'title': 'Mixed language'},\n",
       "  {'docid': '9606120#4',\n",
       "   'text': 'While it is classified as a pidgin language, this is inaccurate. Speakers are already fluent in either English and French, and as such it is not used in situations where both parties lack a common tongue. As a whole, Camfranglais sets itself apart from other pidgins and creoles in that it consists of an array of languages, at least one of which is already known by those speaking it. For instance, while it contains elements of borrowing, code-switching, and pidgin languages, it is not a contact language as both parties can be presumed to speak French, the lexifer. Numerous other classifications have been proposed, like ‘pidgin’, ‘argot’, ‘youth language’, a ‘sabir camerounais’, an ‘appropriation vernaculaire du français’ or a ‘hybrid slang’. However, as Camfranglais is more developed than a slang, this too is insufficient. Kießling proposes it be classified as a \\'highly hybrid sociolect of the urban youth type\", a definition that Stein-Kanjora agrees with.',\n",
       "   'title': 'Camfranglais'}]}"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dev[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each item has four parts: `query_id`, `query`, `positive_passages`, and `negative_passages`. Here, `query_id` and `query` correspond to the id and text content of the qeury. `positive_passages` and `negative_passages` are list of passages with their corresponding `docid`, `title`, and `text`. \n",
    "\n",
    "This structure is the same in the `train`, `dev`, `testA` and `testB` sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we process the ids and text of queries and corpus, and get the qrels of the dev set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus_ids = corpus['docid']\n",
    "corpus_text = []\n",
    "for doc in corpus:\n",
    "   corpus_text.append(f\"{doc['title']} {doc['text']}\".strip())\n",
    "\n",
    "queries_ids = dev['query_id']\n",
    "queries_text = dev['query']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Evaluate from scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the demo we use bge-base-en-v1.5, feel free to change to the model you prefer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os \n",
    "os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'\n",
    "os.environ['SETUPTOOLS_USE_DISTUTILS'] = ''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "initial target device: 100%|██████████| 8/8 [00:29<00:00,  3.66s/it]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 52.84it/s]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 55.15it/s]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 56.49it/s]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 55.22it/s]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 49.22it/s]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 54.69it/s]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 49.16it/s]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 50.77it/s]\n",
      "Chunks: 100%|██████████| 8/8 [00:10<00:00,  1.27s/it]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [08:12<00:00, 32.58it/s]  \n",
      "pre tokenize: 100%|██████████| 16062/16062 [08:44<00:00, 30.60it/s]68s/it]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [08:39<00:00, 30.90it/s]41s/it]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [09:04<00:00, 29.49it/s]43s/it]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [09:27<00:00, 28.29it/s]it/s]t]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [09:08<00:00, 29.30it/s]32s/it]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [08:59<00:00, 29.77it/s]it/s]t]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [09:04<00:00, 29.50it/s]29s/it]\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:10<00:00, 15.59it/s] \n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:04<00:00, 15.68it/s]]\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:01<00:00, 15.72it/s]s]\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:28<00:00, 15.32it/s]\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:43<00:00, 15.10it/s]\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:27<00:00, 15.34it/s]\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:36<00:00, 15.20it/s]\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [17:31<00:00, 15.28it/s]\n",
      "Chunks: 100%|██████████| 8/8 [27:49<00:00, 208.64s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shape of the embeddings: (32893221, 768)\n",
      "data type of the embeddings:  float16\n"
     ]
    }
   ],
   "source": [
    "from FlagEmbedding import FlagModel\n",
    "\n",
    "# get the BGE embedding model\n",
    "model = FlagModel('BAAI/bge-base-en-v1.5')\n",
    "\n",
    "# get the embedding of the queries and corpus\n",
    "queries_embeddings = model.encode_queries(queries_text)\n",
    "corpus_embeddings = model.encode_corpus(corpus_text)\n",
    "\n",
    "print(\"shape of the embeddings:\", corpus_embeddings.shape)\n",
    "print(\"data type of the embeddings: \", corpus_embeddings.dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a Faiss index to store the embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total number of vectors: 32893221\n"
     ]
    }
   ],
   "source": [
    "import faiss\n",
    "import numpy as np\n",
    "\n",
    "# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768\n",
    "dim = corpus_embeddings.shape[-1]\n",
    "\n",
    "# create the faiss index and store the corpus embeddings into the vector space\n",
    "index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)\n",
    "corpus_embeddings = corpus_embeddings.astype(np.float32)\n",
    "# train and add the embeddings to the index\n",
    "index.train(corpus_embeddings)\n",
    "index.add(corpus_embeddings)\n",
    "\n",
    "print(f\"total number of vectors: {index.ntotal}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Searching"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the Faiss index to search for each query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Searching: 100%|██████████| 25/25 [15:03<00:00, 36.15s/it]\n"
     ]
    }
   ],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "query_size = len(queries_embeddings)\n",
    "\n",
    "all_scores = []\n",
    "all_indices = []\n",
    "\n",
    "for i in tqdm(range(0, query_size, 32), desc=\"Searching\"):\n",
    "    j = min(i + 32, query_size)\n",
    "    query_embedding = queries_embeddings[i: j]\n",
    "    score, indice = index.search(query_embedding.astype(np.float32), k=100)\n",
    "    all_scores.append(score)\n",
    "    all_indices.append(indice)\n",
    "\n",
    "all_scores = np.concatenate(all_scores, axis=0)\n",
    "all_indices = np.concatenate(all_indices, axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then map the search results back to the indices in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = {}\n",
    "for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):\n",
    "    results[queries_ids[idx]] = {}\n",
    "    for score, index in zip(scores, indices):\n",
    "        if index != -1:\n",
    "            results[queries_ids[idx]][corpus_ids[index]] = float(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Evaluating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the qrels file for evaluation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "--2024-11-21 10:26:16--  https://hf-mirror.com/datasets/miracl/miracl/resolve/main/miracl-v1.0-en/qrels/qrels.miracl-v1.0-en-dev.tsv\n",
      "Resolving hf-mirror.com (hf-mirror.com)... 153.121.57.40, 133.242.169.68, 160.16.199.204\n",
      "Connecting to hf-mirror.com (hf-mirror.com)|153.121.57.40|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 167817 (164K) [text/plain]\n",
      "Saving to: ‘qrels.miracl-v1.0-en-dev.tsv’\n",
      "\n",
      "     0K .......... .......... .......... .......... .......... 30%  109K 1s\n",
      "    50K .......... .......... .......... .......... .......... 61% 44.5K 1s\n",
      "   100K .......... .......... .......... .......... .......... 91% 69.6K 0s\n",
      "   150K .......... ...                                        100% 28.0K=2.8s\n",
      "\n",
      "2024-11-21 10:26:20 (58.6 KB/s) - ‘qrels.miracl-v1.0-en-dev.tsv’ saved [167817/167817]\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "endpoint = os.getenv('HF_ENDPOINT', 'https://huggingface.co')\n",
    "file_name = \"qrels.miracl-v1.0-en-dev.tsv\"\n",
    "qrel_url = f\"wget {endpoint}/datasets/miracl/miracl/resolve/main/miracl-v1.0-en/qrels/{file_name}\"\n",
    "\n",
    "os.system(qrel_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the qrels from the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "qrels_dict = {}\n",
    "with open(file_name, \"r\", encoding=\"utf-8\") as f:\n",
    "    for line in f.readlines():\n",
    "        qid, _, docid, rel = line.strip().split(\"\\t\")\n",
    "        qid, docid, rel = str(qid), str(docid), int(rel)\n",
    "        if qid not in qrels_dict:\n",
    "            qrels_dict[qid] = {}\n",
    "        qrels_dict[qid][docid] = rel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, use [pytrec_eval](https://github.com/cvangysel/pytrec_eval) library to help us calculate the scores of selected metrics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "defaultdict(<class 'list'>, {'NDCG@10': 0.46073, 'NDCG@100': 0.54336})\n",
      "defaultdict(<class 'list'>, {'Recall@10': 0.55972, 'Recall@100': 0.83827})\n"
     ]
    }
   ],
   "source": [
    "import pytrec_eval\n",
    "from collections import defaultdict\n",
    "\n",
    "ndcg_string = \"ndcg_cut.\" + \",\".join([str(k) for k in [10,100]])\n",
    "recall_string = \"recall.\" + \",\".join([str(k) for k in [10,100]])\n",
    "\n",
    "evaluator = pytrec_eval.RelevanceEvaluator(\n",
    "    qrels_dict, {ndcg_string, recall_string}\n",
    ")\n",
    "scores = evaluator.evaluate(results)\n",
    "\n",
    "all_ndcgs, all_recalls = defaultdict(list), defaultdict(list)\n",
    "for query_id in scores.keys():\n",
    "    for k in [10,100]:\n",
    "        all_ndcgs[f\"NDCG@{k}\"].append(scores[query_id][\"ndcg_cut_\" + str(k)])\n",
    "        all_recalls[f\"Recall@{k}\"].append(scores[query_id][\"recall_\" + str(k)])\n",
    "\n",
    "ndcg, recall = (\n",
    "    all_ndcgs.copy(),\n",
    "    all_recalls.copy(),\n",
    ")\n",
    "\n",
    "for k in [10,100]:\n",
    "    ndcg[f\"NDCG@{k}\"] = round(sum(ndcg[f\"NDCG@{k}\"]) / len(scores), 5)\n",
    "    recall[f\"Recall@{k}\"] = round(sum(recall[f\"Recall@{k}\"]) / len(scores), 5)\n",
    "\n",
    "print(ndcg)\n",
    "print(recall)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Evaluate using FlagEmbedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in [example](../../examples/evaluation/miracl/eval_miracl.sh) folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "arguments = \"\"\"- \\\n",
    "    --eval_name miracl \\\n",
    "    --dataset_dir ./miracl/data \\\n",
    "    --dataset_names en \\\n",
    "    --splits dev \\\n",
    "    --corpus_embd_save_dir ./miracl/corpus_embd \\\n",
    "    --output_dir ./miracl/search_results \\\n",
    "    --search_top_k 100 \\\n",
    "    --cache_path ./cache/data \\\n",
    "    --overwrite True \\\n",
    "    --k_values 10 100 \\\n",
    "    --eval_output_method markdown \\\n",
    "    --eval_output_path ./miracl/miracl_eval_results.md \\\n",
    "    --eval_metrics ndcg_at_10 recall_at_100 \\\n",
    "    --embedder_name_or_path BAAI/bge-base-en-v1.5 \\\n",
    "    --devices cuda:0 cuda:1 \\\n",
    "    --embedder_batch_size 1024\n",
    "\"\"\".replace('\\n','')\n",
    "\n",
    "sys.argv = arguments.split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/root/anaconda3/envs/dev/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "initial target device: 100%|██████████| 2/2 [00:09<00:00,  4.98s/it]\n",
      "pre tokenize: 100%|██████████| 16062/16062 [18:01<00:00, 14.85it/s]  \n",
      "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n",
      "/root/anaconda3/envs/dev/lib/python3.12/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml\n",
      "  warnings.warn(\n",
      "pre tokenize: 100%|██████████| 16062/16062 [18:44<00:00, 14.29it/s]92s/it]\n",
      "Inference Embeddings:   0%|          | 42/16062 [00:54<8:28:19,  1.90s/it]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n",
      "Inference Embeddings:   0%|          | 43/16062 [00:56<8:22:03,  1.88s/it]/root/anaconda3/envs/dev/lib/python3.12/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml\n",
      "  warnings.warn(\n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [48:29<00:00,  5.52it/s] \n",
      "Inference Embeddings: 100%|██████████| 16062/16062 [48:55<00:00,  5.47it/s]\n",
      "Chunks: 100%|██████████| 2/2 [1:10:57<00:00, 2128.54s/it]  \n",
      "pre tokenize: 100%|██████████| 1/1 [00:11<00:00, 11.06s/it]\n",
      "pre tokenize: 100%|██████████| 1/1 [00:12<00:00, 12.72s/it]\n",
      "Inference Embeddings: 100%|██████████| 1/1 [00:00<00:00, 32.15it/s]\n",
      "Inference Embeddings: 100%|██████████| 1/1 [00:00<00:00, 39.80it/s]\n",
      "Chunks: 100%|██████████| 2/2 [00:31<00:00, 15.79s/it]\n",
      "Searching: 100%|██████████| 25/25 [00:00<00:00, 26.24it/s]\n",
      "Qrels not found in ./miracl/data/en/dev_qrels.jsonl. Trying to download the qrels from the remote and save it to ./miracl/data/en.\n",
      "--2024-11-20 13:00:40--  https://hf-mirror.com/datasets/miracl/miracl/resolve/main/miracl-v1.0-en/qrels/qrels.miracl-v1.0-en-dev.tsv\n",
      "Resolving hf-mirror.com (hf-mirror.com)... 133.242.169.68, 153.121.57.40, 160.16.199.204\n",
      "Connecting to hf-mirror.com (hf-mirror.com)|133.242.169.68|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 167817 (164K) [text/plain]\n",
      "Saving to: ‘./cache/data/miracl/qrels.miracl-v1.0-en-dev.tsv’\n",
      "\n",
      "     0K .......... .......... .......... .......... .......... 30%  336K 0s\n",
      "    50K .......... .......... .......... .......... .......... 61%  678K 0s\n",
      "   100K .......... .......... .......... .......... .......... 91%  362K 0s\n",
      "   150K .......... ...                                        100% 39.8K=0.7s\n",
      "\n",
      "2024-11-20 13:00:42 (231 KB/s) - ‘./cache/data/miracl/qrels.miracl-v1.0-en-dev.tsv’ saved [167817/167817]\n",
      "\n",
      "Loading and Saving qrels: 100%|██████████| 8350/8350 [00:00<00:00, 184554.95it/s]\n"
     ]
    }
   ],
   "source": [
    "from transformers import HfArgumentParser\n",
    "\n",
    "from FlagEmbedding.evaluation.miracl import (\n",
    "    MIRACLEvalArgs, MIRACLEvalModelArgs,\n",
    "    MIRACLEvalRunner\n",
    ")\n",
    "\n",
    "\n",
    "parser = HfArgumentParser((\n",
    "    MIRACLEvalArgs,\n",
    "    MIRACLEvalModelArgs\n",
    "))\n",
    "\n",
    "eval_args, model_args = parser.parse_args_into_dataclasses()\n",
    "eval_args: MIRACLEvalArgs\n",
    "model_args: MIRACLEvalModelArgs\n",
    "\n",
    "runner = MIRACLEvalRunner(\n",
    "    eval_args=eval_args,\n",
    "    model_args=model_args\n",
    ")\n",
    "\n",
    "runner.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"en-dev\": {\n",
      "        \"ndcg_at_10\": 0.46053,\n",
      "        \"ndcg_at_100\": 0.54313,\n",
      "        \"map_at_10\": 0.35928,\n",
      "        \"map_at_100\": 0.38726,\n",
      "        \"recall_at_10\": 0.55972,\n",
      "        \"recall_at_100\": 0.83809,\n",
      "        \"precision_at_10\": 0.14018,\n",
      "        \"precision_at_100\": 0.02347,\n",
      "        \"mrr_at_10\": 0.54328,\n",
      "        \"mrr_at_100\": 0.54929\n",
      "    }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "with open('miracl/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:\n",
    "    print(content_file.read())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
